-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add training service #225
Add training service #225
Conversation
bbf0243
to
ace0793
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #225 +/- ##
==========================================
- Coverage 64.60% 64.48% -0.12%
==========================================
Files 40 47 +7
Lines 2195 2689 +494
==========================================
+ Hits 1418 1734 +316
- Misses 777 955 +178 ☔ View full report in Codecov by Sentry. |
ace0793
to
cbb2619
Compare
@@ -1,7 +1,7 @@ | |||
[pytest] | |||
python_files = test_*.py | |||
addopts = | |||
--timeout 10 | |||
--timeout 60 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference in creating threads, processes using a start method "spawn" instead of "fork" is quite significant, that led me to bump it, so the tests can pass for macos and windows platforms.
e9fc4b0
to
f345f69
Compare
750ee88
to
d432e25
Compare
- Supported operations: start, resume, pause, shutdown - pytorch-3dunet package is used as the framework to create the models
… failed I caught an edge case, where events are blocked, because we have exited the training, and the tasks of the queue would remain unprocessed.
Creating and closing processes and threads can be quite time consuming resulting to test timeouts if the tests performs a lot of actions.
Applying monkeypatch to a parent process won't propagated to a child process if the start method is spawn (macos) instead of fork (linux)
- To fix test on windows, convert label data to float64
The should stop callbacks are boolean, so we need to aggregate their return value. Previously the return value wasn't taken into account, and the callbacks were returning none
The enum is used for validation check before triggering one of them. Previously I was checking if the queue was alive, but that won't be enough, for example if you want to perform resume, while you are resumed, the queue is operational, but the action shouldn't be valid.
b9153b1
to
a088627
Compare
09fcb85
to
4a5ed85
Compare
52fb50f
to
752cb59
Compare
Move NamedInt and Tensor proto to a separate file so training proto can use as well
- The inference servicer had a procedure to list the available devices. This is needed or the training servicer as well. So list devices is decoupled to be shared.
752cb59
to
27b3923
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @thodkatz, this is exciting! Also thank you for being available in DMs/calls for some questions. After looking at the proposed changes I have to say that you followed the original design through the whole onion nicely.
The tests look also pretty thorough and instill confidence, especially given that we target linux, mac, and windows platforms.
Seeing how little you had to adapt the pytorch-3dunet
trainer class gives me confidence, that it was a good decision to go for it.
--> LGTM
Hey @k-dominik ! Thanks a lot for the kind words :) It is true that the onion kind of structure needs to be revised and simplified, totally agree with that. I hope the tests will help with the refactoring part. Of course, I am always available :) |
An initial implementation of adding neural network training in tiktorch :P
It requires:
The package
pytorch-3dunet
is used as our framework to configure the models.Supported workflows:
The above are demonstrated with tests. Please check them first :)
For thread synchronization we rely on a thread-safe priority queue (todo: explain the scheme with the concept of ICommand, and the error handling).
Manual testing
I have created a setup to test the server with actual data. You can find more here: https://github.com/thodkatz/ilastik-playground